在过去的几十年中,由于其在广泛的应用中,现场文本认可从学术界和实际用户获得了全世界的关注。尽管在光学字符识别方面取得了成就,但由于诸如扭曲或不规则布局等固有问题,现场文本识别仍然具有挑战性。大多数现有方法主要利用基于复发或卷积的神经网络。然而,虽然经常性的神经网络(RNN)通常由于顺序计算而遭受慢的训练速度,并且遇到消失的梯度或瓶颈,但CNN在复杂性和性能之间衡量折衷。在本文中,我们介绍了SAFL,一种基于自我关注的神经网络模型,具有场景文本识别的焦点损失,克服现有方法的限制。使用焦损而不是负值对数似然有助于模型更多地关注低频样本训练。此外,为应对扭曲和不规则文本,我们在传递到识别网络之前,我们利用空间变换(STN)来纠正文本。我们执行实验以比较拟议模型的性能与七个基准。数值结果表明,我们的模型实现了最佳性能。
translated by 谷歌翻译
In this work, we propose a new approach that combines data from multiple sensors for reliable obstacle avoidance. The sensors include two depth cameras and a LiDAR arranged so that they can capture the whole 3D area in front of the robot and a 2D slide around it. To fuse the data from these sensors, we first use an external camera as a reference to combine data from two depth cameras. A projection technique is then introduced to convert the 3D point cloud data of the cameras to its 2D correspondence. An obstacle avoidance algorithm is then developed based on the dynamic window approach. A number of experiments have been conducted to evaluate our proposed approach. The results show that the robot can effectively avoid static and dynamic obstacles of different shapes and sizes in different environments.
translated by 谷歌翻译
Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area.
translated by 谷歌翻译
Event Extraction (EE) is one of the fundamental tasks in Information Extraction (IE) that aims to recognize event mentions and their arguments (i.e., participants) from text. Due to its importance, extensive methods and resources have been developed for Event Extraction. However, one limitation of current research for EE involves the under-exploration for non-English languages in which the lack of high-quality multilingual EE datasets for model training and evaluation has been the main hindrance. To address this limitation, we propose a novel Multilingual Event Extraction dataset (MEE) that provides annotation for more than 50K event mentions in 8 typologically different languages. MEE comprehensively annotates data for entity mentions, event triggers and event arguments. We conduct extensive experiments on the proposed dataset to reveal challenges and opportunities for multilingual EE.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Robots have been brought to work close to humans in many scenarios. For coexistence and collaboration, robots should be safe and pleasant for humans to interact with. To this end, the robots could be both physically soft with multimodal sensing/perception, so that the robots could have better awareness of the surrounding environment, as well as to respond properly to humans' action/intention. This paper introduces a novel soft robotic link, named ProTac, that possesses multiple sensing modes: tactile and proximity sensing, based on computer vision and a functional material. These modalities come from a layered structure of a soft transparent silicon skin, a polymer dispersed liquid crystal (PDLC) film, and reflective markers. Here, the PDLC film can switch actively between the opaque and the transparent state, from which the tactile sensing and proximity sensing can be obtained by using cameras solely built inside the ProTac link. In this paper, inference algorithms for tactile proximity perception are introduced. Evaluation results of two sensing modalities demonstrated that, with a simple activation strategy, ProTac link could effectively perceive useful information from both approaching and in-contact obstacles. The proposed sensing device is expected to bring in ultimate solutions for design of robots with softness, whole-body and multimodal sensing, and safety control strategies.
translated by 谷歌翻译
流视频是创作者与观众分享创意作品的方法之一。在这些视频中,流媒体分享了如何通过在一个或几个用于创意项目的程序中使用各种工具来实现最终目标。为此,可以讨论实现最终目标所需的步骤。因此,这些视频可以提供大量的教育内容,这些内容可用于学习如何使用流媒体使用的工具。但是,缺点之一是,流媒体可能无法为每个步骤提供足够的详细信息。因此,对于学习者来说,可能很难赶上所有步骤。为了减轻此问题,一种解决方案是将流视频与流视频中使用的工具可用的相关教程联系起来。更具体地说,系统可以分析实时流媒体视频的内容,并推荐最相关的教程。由于现有的文档推荐模型无法处理这种情况,因此在这项工作中,我们为实时流程视频的教程建议提供了一个新颖的数据集和模型。我们对拟议的数据集和模型进行了广泛的分析,揭示了该任务的挑战性质。
translated by 谷歌翻译
键形提取是NLP中文档理解的重要任务之一。虽然大多数先前的作品都致力于正式设置,例如书籍,新闻或网络博客,但探索视频成绩单等非正式文本的探索较少。为了解决这一局限性,在这项工作中,我们提出了一种新颖的语料库和方法,用于从Behance平台上流的视频的成绩单中提取钥匙短语。更具体地说,在这项工作中,提出了一种新型的数据增强,以通过从其他域中提取键形提取任务的背景知识来丰富模型。提出的数据集数据集上的广泛实验显示了引入方法的有效性。
translated by 谷歌翻译
本文讨论了面部表达识别模型和描述生成模型,以构建图像中人的图像和面部表情的描述性句子。我们的研究表明,Yolov5比传统的CNN获得了KDEF数据集的所有情绪的更好结果。特别是,CNN和Yolov5模型的精度分别为0.853和0.938。使用VGG16与LSTM模型编码的描述提出了用于基于合并体系结构的图像描述的模型。 Yolov5还用于识别图像中对象的主要颜色,并在必要时纠正生成的描述中的颜色单词。如果描述包含指称一个人的单词,我们会认识到图像中人的情感。最后,我们结合了所有模型的结果,以创建描述图像中视觉内容和人类情感的句子。越南语中FlickR8K数据集的实验结果实现了BLLEU-1,BLEU-2,BLEU-3,BLEU-4分数为0.628; 0.425; 0.280;和0.174。
translated by 谷歌翻译
在本文中,我们介绍了一个高质量的大规模基准数据集,用于英语 - 越南语音翻译,其中有508音频小时,由331k的三胞胎组成(句子长度的音频,英语源笔录句,越南人目标subtitle句子)。我们还使用强基础进行了经验实验,发现传统的“级联”方法仍然优于现代“端到端”方法。据我们所知,这是第一个大规模的英语 - 越南语音翻译研究。我们希望我们的公开数据集和研究都可以作为未来研究和英语语音翻译应用的起点。我们的数据集可从https://github.com/vinairesearch/phost获得
translated by 谷歌翻译